1 The condition of Tanzanian water access points

Water is life. This is especially true in places where it is sparse as in huge parts of Africa. To provide people with fresh water organizations build water pumps, but oftentimes they do not further pay for maintenance and they break down, becoming useless. The online platform Taarifa collects data of water pumps in Tanzania and wants to predict, which ones are broken or will soon break down to be able to organize maintenance. The data science competition platform drivendata.com hosts a challenge, where the community can help with the prediction effort (http://www.drivendata.org/competitions/7/page/23/). The data used in this report corresponds to the training data provided for the challenge.

2 Data Preparation and Univariate Analysis

As a first step the variables given in the data set will be inspected:

##  [1] "amount_tsh"            "basin"                
##  [3] "construction_year"     "date_recorded"        
##  [5] "district_code"         "extraction_type"      
##  [7] "extraction_type_class" "extraction_type_group"
##  [9] "funder"                "gps_height"           
## [11] "id"                    "installer"            
## [13] "latitude"              "lga"                  
## [15] "longitude"             "management"           
## [17] "management_group"      "num_private"          
## [19] "payment"               "payment_type"         
## [21] "permit"                "population"           
## [23] "public_meeting"        "quality_group"        
## [25] "quantity"              "quantity_group"       
## [27] "recorded_by"           "region"               
## [29] "region_code"           "scheme_management"    
## [31] "scheme_name"           "source"               
## [33] "source_class"          "source_type"          
## [35] "status_group"          "subvillage"           
## [37] "ward"                  "water_quality"        
## [39] "waterpoint_type"       "waterpoint_type_group"
## [41] "wpt_name"

The list of variables is quite long. To structure the analysis a bit, they were sorted into classes:

  1. Geographical location
  1. Water point properties
  1. Water point management
  1. Data acquisition related

3 Descriptive Statistics

3.2 Geographical location

3.2.1 basin

##           Lake Victoria                 Pangani                  Rufiji 
##                   10248                    8940                    7976 
##                Internal         Lake Tanganyika             Wami / Ruvu 
##                    7785                    6432                    5987 
##              Lake Nyasa Ruvuma / Southern Coast              Lake Rukwa 
##                    5085                    4493                    2454

The basin variable gives the geographical location of the water point and in some cases probably also the source of the water as for Lake Victoria for example. With nine levels the danger of overfitting should be minimal. And since basins are caused by natural water sources they might directly influence the water points’ functionality.

3.2.2 gps_height

##    Max. 3rd Qu.    Mean  Median 1st Qu.    Min. 
##  2770.0  1319.0   668.3   369.0     0.0   -90.0

There are again a lot of apparent NAs represented as 0. They were removed for the following plot:

## [1] "Min.   : -90  " "Median :1167  " "Mean   :1019  " "Max.   :2770  "
## [5] "3rd Qu.:1498  " "1st Qu.: 393  "

The height on which the water point is situated might influence the water points functionality, e.g. by available water sources or climate. The span of heights is quite big, ranging from -90m to 2770m. The large number of water points at 0m is probably rather due to them being NAs, but they could in theory also be the real height.

3.2.3 region

##      Iringa   Shinyanga       Mbeya Kilimanjaro    Morogoro      Arusha 
##        5294        4982        4639        4379        4006        3350 
##      Kagera      Mwanza      Kigoma      Ruvuma 
##        3316        3102        2816        2640

The regions are the coarsest federal unit in Tanzania. As political instance the differences in politics between regions might also influence the functionality of water points, e.g. by subsidizing.

3.2.4 region_code

##    Max. 3rd Qu.    Mean  Median 1st Qu.    Min. 
##    99.0    17.0    15.3    12.0     5.0     1.0

The region code should be just a coded version of the region variable. This would make it redundant. Since regions are not a continuous variable, the non-coded version of the variable might be better suited. There are more unique region codes (27) than region names (21) in the data set. Thus there might be faulty data.

3.2.5 district_code

##    Max.    Mean 3rd Qu.  Median 1st Qu.    Min. 
##   80.00    5.63    5.00    3.00    2.00    0.00

Districts are the next smaller federal unit in Tanzania. There are 169 districts in total, but only 20 appear in the data set. This hints either to a misleading feature label or to faulty data.

3.2.6 lga

##      (Other)       Njombe Arusha Rural  Moshi Rural      Bariadi 
##         3437         2503         1252         1251         1177 
##       Rungwe       Kilosa       Kasulu        Mbozi         Meru 
##         1106         1094         1047         1034         1009

The local government authority is the government for on a level smaller than the regions. Most of the time they should be overlapping with the districts. Thus this variable might be redundant to district_code, but again there is a discrepancy in the number of unique values.

3.2.7 ward

##   (Other)     Igosi  Imalinyi Siha Kati    Mdandu   Nduruma   Kitunda 
##     47666       307       252       232       231       217       203 
##   Mishamo    Msindo  Chalinze 
##       203       201       196

Wards are again a smaller federal unit consisting of up to 21000 people. The number of levels in this feature is huge, which might interfere with modeling. There seems to be a substantive difference in number of water points between wards although they are divided by population. Thus over usage might occur in some wards.

3.2.8 subvillage

##  (Other) Madukani  Shuleni  Majengo     Kati           Mtakuja   Sokoni 
##    50316      508      506      502      373      371      262      232 
##        M Muungano 
##      187      172
##              Isanga Mtaa Wa Kipunguni B            Mwangaza 
##                  34                  35                  35 
##          Njia Panda           Njiapanda             Tankini 
##                  35                  35                  35 
##            Chemchem            Kijijini           Mchangani 
##                  36                  36                  36 
##              Temeke 
##                  36

Subvillage is the finest federal unit and contains even more factor levels, even so much, that some bars cannot be visualized in the plot as the one for Madukani, which would be the highest one with a count of 508. This variable should behave similar to the wards variable.

3.2.9 latitude and longitude

##       Max.    3rd Qu.     Median       Mean    1st Qu.       Min. 
## -2.000e-08 -3.326e+00 -5.022e+00 -5.706e+00 -8.541e+00 -1.165e+01
##    Max. 3rd Qu.  Median    Mean 1st Qu.    Min. 
##   40.35   37.18   34.91   34.08   33.09    0.00

The coordinates encode the geographic location with the highest resolution in this data set. At the same time, they are continuous rather than discrete. Also a value of 0 again represents NAs, since neither the null meridian nor the equator run through Tanzania, and were thus not plotted. The peak at the latitude of about -3 is probably there because that includes the southern shore of Lake Victoria, the Serengeti National Park and the Kilimanjaro National Park, which are highly populated and very touristic.

3.3 Water point properties

3.3.1 status_group

##              functional          non functional functional needs repair 
##                   32259                   22824                    4317

Slightly more than half of the water points are functional (`0.5430808). About 38.4242424 % are non functional, which is a quite big fraction. A significantly smaller amount is still functional but needs repair (0.0726768).

3.3.2 amount_tsh

##     Max.     Mean  3rd Qu.     Min.  1st Qu.   Median 
## 350000.0    317.7     20.0      0.0      0.0      0.0

The total static head (tsh) is a rather technical measure. Most values are given as zero. Tsh is a measure giving the work a pump must perform to deliver water to the surface. Since this is a pretty technical value, it is probably only measured at water points that are regularly maintained. Thus the high number of zeros, which probably are NAs, might be indicative of bad maintenance. But TSH is also only a value needed for water points worked by a pumping mechanism. Since only a fraction of water points will use pumping mechanisms, some NAs can also be accounted by this fact. To explore the available data without NAs it is plotted without zeros and on a log-scale:

## [1] "1st Qu.:    50.0  " "3rd Qu.:  1000.0  " "Max.   :350000.0  "
## [4] "Mean   :  1062.4  " "Median :   250.0  " "Min.   :     0.2  "

The distribution roughly resembles a normal distribution, if log-transformed, but there are some values over- or underrepresented respectively. This might indicate that there might be an outside influence on the TSH metric of the pumps.

3.3.3 construction_year

##    Max. 3rd Qu.  Median    Mean    Min. 1st Qu. 
##    2013    2004    1986    1301       0       0

The number of water points built each year seems to be growing although it could also be that the oldest ones already have been demolished and thus were not considered anymore in the data collection.

3.3.4 extraction_type*

##           gravity       nira/tanira             other       submersible 
##             26780              8154              6430              4764 
##            swn 80              mono     india mark ii           afridev 
##              3670              2865              2400              1770 
##               ksb other - rope pump 
##              1415               451
##        gravity    nira/tanira          other    submersible         swn 80 
##          26780           8154           6430           6179           3670 
##           mono  india mark ii        afridev      rope pump other handpump 
##           2865           2400           1770            451            364
##      gravity     handpump        other  submersible    motorpump 
##        26780        16456         6430         6179         2987 
##    rope pump wind-powered 
##          451          117

These three variables represent the same data with different levels of detail. In the higher detailed variables (extraction_type, extraction_type_group) the level of detail for all levels does not seem consistent (e.g. India Mark ii vs. Gravity). This might be unsuitable for prediction. Since using all of those features does not make sense, since none should add significant additional information, the variable with the lowest amount of detail, but being the cleanest might be the best choice. It seems there are only few motorized water points. The most are operated by natural forces like gravity or wind, followed by manually operated water points.

3.3.5 water_quality and quality_group

##               soft              salty            unknown 
##              50818               4856               1876 
##              milky           coloured    salty abandoned 
##                804                490                339 
##           fluoride fluoride abandoned 
##                200                 17
##     good    salty  unknown    milky  colored fluoride 
##    50818     5195     1876      804      490      217

For the water quality there are again more than one variable of which one was cleaned up by accumulating levels that may have been overly detailed. It seems that most water points actually serve good quality water.

3.3.6 quantity*

##       enough insufficient          dry     seasonal      unknown 
##        33186        15129         6246         4050          789
##       enough insufficient          dry     seasonal      unknown 
##        33186        15129         6246         4050          789

For these two variables the levels are actually identical, thus only quantity has to be considered. The number of water points giving sufficient water is very similar to the number of functional water points. Quite a big amount of water points either give insufficient water, only seasonal or are completely dry.

3.3.7 source*

##               spring         shallow well          machine dbh 
##                17021                16824                11075 
##                river rainwater harvesting             hand dtw 
##                 9612                 2295                  874 
##                 lake                  dam                other 
##                  765                  656                  212 
##              unknown 
##                   66
##               spring         shallow well             borehole 
##                17021                16824                11949 
##           river/lake rainwater harvesting                  dam 
##                10377                 2295                  656 
##                other 
##                  278
## groundwater     surface     unknown 
##       45794       13328         278

For this variable it might not be the best choice to use the most cleaned up one, since it just differentiates between ground water and surface water or unknown sources. On the other hand, the source_type variable additionally differentiates between subtypes, while still being clean. Most water sources are groundwater sources rather than surface water sources, which would be expected in dry climate.

3.3.8 waterpoint_type

##          communal standpipe                   hand pump 
##                       28522                       17488 
##                       other communal standpipe multiple 
##                        6380                        6103 
##             improved spring               cattle trough 
##                         784                         116 
##                         dam 
##                           7

The waterpoint_type variable seems to be slightly redundant to source_type and extraction_type considering the level names. But it only contains 7 instances of the level dam, while source_type contains 656 instances of dam. There might be faulty data in the dataset. Since waterpoint_type contains a considerable number of NAs (other), it might also be the lack of data causing the discrepancy.

3.4 Water point management

3.4.1 funder

##                (Other) Government Of Tanzania                        
##                  11757                   9084                   3635 
##                 Danida                 Hesawa                  Rwssp 
##                   3114                   2202                   1374 
##             World Bank                   Kkkt           World Vision 
##                   1349                   1287                   1246 
##                 Unicef 
##                   1057

There are quite a lot of funders for water points, but the government funds by far the most water points. There sheer number of levels will make it useless for direct use in classification, but additional more useful features might be extracted from it.

3.4.2 installer

##        DWE    (Other)            Government        RWE      Commu 
##      17402      12250       3655       1825       1206       1060 
##     DANIDA       KKKT     Hesawa          0 
##       1050        898        840        777

For the installer variable it is similar as for the funder variable. Again the government or a government department (DWE) is the biggest installer.

3.4.3 management*

##              vwc              wug      water board              wua 
##            40507             6515             2933             2535 
## private operator       parastatal  water authority            other 
##             1971             1768              904              844 
##          company          unknown 
##              685              561
## user-group commercial parastatal      other    unknown 
##      52490       3638       1768        943        561

Like some variables discussed above two variables show the same data, just cleaned up. Here the less detailed variable seems to be the better choice, since it seems to be more consistent. Most water points actually seem to be managed by the users themselves, which would probably not be the most efficient way.

3.4.4 payment*

##             never pay        pay per bucket           pay monthly 
##                 25348                  8985                  8300 
##               unknown pay when scheme fails          pay annually 
##                  8157                  3914                  3642 
##                 other 
##                  1054
##  never pay per bucket    monthly    unknown on failure   annually 
##      25348       8985       8300       8157       3914       3642 
##      other 
##       1054

About half of the water points can be used for free. A big part of those might just be rivers and lakes. The other half is paid by different schemes, e.g. per use or monthly.

3.4.5 permit

##  True False       
## 38852 17492  3056

Most water points are permitted. But it is unclear whether that means that building this water point was permitted or whether you need a permit to collect water from it.

3.4.6 population

##    Max. 3rd Qu.    Mean  Median    Min. 1st Qu. 
## 30500.0   215.0   179.9    25.0     0.0     0.0

The population variable is quite skewed. Most water points are in an area with very low population. When plotting after removing the zeros and using a logarithmic scale, the plot shows that there are two distinct groups:

## [1] "1st Qu.:   40.0  " "3rd Qu.:  324.0  " "Max.   :30500.0  "
## [4] "Mean   :  281.1  " "Median :  150.0  " "Min.   :    1.0  "

One group of water points have only very few people (< 10) living around them. For the other water points the distribution follows roughly a normal distribution.

3.4.7 public_meeting

##  True False       
## 51011  5055  3334

In most places there seem to be public meetings.

3.4.8 scheme*

##                                                 (Other) 
##                       28166                       19572 
##                           K                        None 
##                         682                         644 
##                    Borehole               Chalinze wate 
##                         546                         405 
##                           M                      DANIDA 
##                         400                         379 
##                  Government Ngana water supplied scheme 
##                         320                         270
##              VWC              WUG                   Water authority 
##            36793             5206             3877             3153 
##              WUA      Water Board       Parastatal Private operator 
##             2883             2748             1680             1063 
##          Company            Other 
##             1061              766

The scheme_name variable contains a lot of levels, which are not well plotable and probably only contain little information. The overall scheme categories are summarized in the scheme_management variable. The most common theme is the Village Water Committee (VWC), which fits well to the management_group variable, which states that most are managed by user groups.

3.5 Summary (Univariate Analysis)

The biggest observation that could be done in the univariate analysis is probably that the data is in need of some cleaning. There are a lot of variables useless for prediction algorithms, either because of having too many unique values, being redundant or are unique to each entry. For the data acquisition related variables, we can remove the id, num_private, recorded_by, and wpt_name variables. For the geographical location variables, we will be able to remove the region_code, district code, ward and subvillage variable. The lga variable could be interesting, but it might be slightly redundant to the region variable, thus only one of those should be chosen. Since lga has 125 unique categorical values, which cannot be handled by R-implementations of algorithms like random forest, it will be removed as well. The water point properties contain some variables that are apparently intermediates of previous data cleanup. Of those variables only one should be used. Some considerations were already described above.

The same is true for the water point management variables. Additionally, the funder and installer variables contain too many unique values and will be removed. But for those variables new better usable variables will be created to model the experience of the respective funder/installer. To do so the number of wells installed or funded by a company will used as an experience score.

##    Max.    Mean 3rd Qu.  Median 1st Qu.    Min. 
##    9084    1942    1374     470      78       1

## 3rd Qu.    Max.    Mean  Median 1st Qu.    Min. 
##   17400   17400    5351     408      71       1

Both plots look pretty similar. Most water points are funded and installed by organizations with little ‘experience’, but in both cases there is a huge outlier, which would probably be the government as discussed for the original variables.

The variable most interesting as a dependent variable is status_group, which is also the label in the contest, the data was taken from, since predicting the functional status of a water point would allow to better organize maintenance. But there might be other interesting connections in the data. For example, the water quality could connect to the quantity of water or how regional differences influence the different variables.

4 Bivariate Analysis

4.1 Dependency of status_group

General note on plots: The categorical variables will be plotted as stacked bar plots, where the color indicates the functionality, the count of water points per level is plotted on y and the variable value on x. Both the absolute and the relative distribution of counts will be plotted. In the plots containing the relative values a black line indicates the relative number of functional water points in the whole data set. This was done to give an orientation about whether water points in certain categories perform worse or better than the overall average. This approach has a caveat: If a category or bin contains comparatively few entries the fraction of functional water points might not be representative. This happens quite often in this dataset, illustrating the inhomogeneity and skewedness of the data.

4.1.1 date_recorded

## [1] "non functional"
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(wideSubData$independent) and wideSubData[[paste0("Freq.", i)]]
## t = 0.11458, df = 342, p-value = 0.9088
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.09961056  0.11186387
## sample estimates:
##         cor 
## 0.006195927 
## 
## [1] "functional"
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(wideSubData$independent) and wideSubData[[paste0("Freq.", i)]]
## t = -0.54868, df = 342, p-value = 0.5836
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.13497413  0.07632441
## sample estimates:
##         cor 
## -0.02965616 
## 
## [1] "functional needs repair"
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(wideSubData$independent) and wideSubData[[paste0("Freq.", i)]]
## t = 1.4647, df = 342, p-value = 0.1439
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.02701004  0.18316866
## sample estimates:
##       cor 
## 0.0789567

There is no major change in status group fractions recorded over time. That is means, that there is no bias due to the recording time and that there were no rare events causing unusual changes, which could lead to outliers negatively influencing prediction efforts. This also means that the date_recorded variable would not be adding a lot of information to a model and is probably not useful.

4.1.2 region

While in most regions the fraction of functional water points is close to the overall mean fraction, the regions Lindi and Mtwara and to a lesser extend Mara, Ruvuma and Tabora perform worse. On the other hand, Arusha, Iringa, Kilimanjaro and Manyara possess above average functioning water points. This indicates that there could be political or geographical factors influencing the status of water points.

4.1.3 basin

Like the region variable the basin variable contains values performing better or worse than the overall performance. Interestingly the Lake Rukwa and Ruvuma/southern Coast basins contain data entries from regions with lower amounts of functional water points. Although the categorization into basins relates the data to a more geographical than political context, political influences might still be a major confounder, since the overlap is most of the time quite big.

##                basin
## region          Internal Lake Nyasa Lake Rukwa Lake Tanganyika
##   Arusha            1309          0          0               0
##   Dar es Salaam        0          0          0               0
##   Dodoma             827          0          0               0
##   Iringa               0       1582          0               0
##   Kagera               0          0          0             341
##   Kigoma               0          0          0            2816
##   Kilimanjaro        169          0          0               0
##   Lindi                0          0          0               0
##   Manyara           1206          0          0               0
##   Mara                 0          0          0               0
##   Mbeya                0       2430       1427               0
##   Morogoro             0          0          0               0
##   Mtwara               0          0          0               0
##   Mwanza               0          0          0              99
##   Pwani                0          0          0               0
##   Rukwa                0          0       1011             797
##   Ruvuma               0       1073          0               0
##   Shinyanga         1641          0          0            1072
##   Singida           1992          0          1               8
##   Tabora             641          0         15            1299
##   Tanga                0          0          0               0
##                basin
## region          Lake Victoria Pangani Rufiji Ruvuma / Southern Coast
##   Arusha                   32    2009      0                       0
##   Dar es Salaam             0       0      0                       0
##   Dodoma                    0       0    359                       0
##   Iringa                    0       0   3712                       0
##   Kagera                 2975       0      0                       0
##   Kigoma                    0       0      0                       0
##   Kilimanjaro               0    4210      0                       0
##   Lindi                     0       0     90                    1456
##   Manyara                   0     288      0                       0
##   Mara                   1969       0      0                       0
##   Mbeya                     0       0    782                       0
##   Morogoro                  0       0   1893                       0
##   Mtwara                    0       0      0                    1730
##   Mwanza                 3003       0      0                       0
##   Pwani                     0       0    784                       0
##   Rukwa                     0       0      0                       0
##   Ruvuma                    0       0    260                    1307
##   Shinyanga              2269       0      0                       0
##   Singida                   0       0     92                       0
##   Tabora                    0       0      4                       0
##   Tanga                     0    2433      0                       0
##                basin
## region          Wami / Ruvu
##   Arusha                  0
##   Dar es Salaam         805
##   Dodoma               1015
##   Iringa                  0
##   Kagera                  0
##   Kigoma                  0
##   Kilimanjaro             0
##   Lindi                   0
##   Manyara                89
##   Mara                    0
##   Mbeya                   0
##   Morogoro             2113
##   Mtwara                  0
##   Mwanza                  0
##   Pwani                1851
##   Rukwa                   0
##   Ruvuma                  0
##   Shinyanga               0
##   Singida                 0
##   Tabora                  0
##   Tanga                 114

4.1.4 gps_height

The first impression is that the higher the water point location the likelier it is that it is functional. This is probably at least in part influenced by the far lower amount of water points at great heights and might thus be a sampling effect. On the other hand, there might be confounding factors like lower usage, better climate conditions or outdoor tourism causing an increased number of functional water points.

4.1.5 amount_tsh

As for the gps_height variable, there might also be a sampling problem for amount_tsh. In this case this could actually be directly transferred into useful information. The tsh measure is probably not easily obtainable, maybe even only by professionals. Thus a measured tsh might indicate maintenance of the water point at the time of recording, causing the observed high fractions of functioning water points for entries with given amount_tsh values. (amount_tsh values were log10-transformed.)

4.1.6 construction_year

## [1] "non functional"
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(wideSubData$independent) and wideSubData[[paste0("Freq.", i)]]
## t = -11.353, df = 52, p-value = 1.087e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9068931 -0.7446519
## sample estimates:
##        cor 
## -0.8441073 
## 
## [1] "functional"
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(wideSubData$independent) and wideSubData[[paste0("Freq.", i)]]
## t = 11.746, df = 52, p-value = 3.008e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7572608 0.9118952
## sample estimates:
##       cor 
## 0.8522211 
## 
## [1] "functional needs repair"
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(wideSubData$independent) and wideSubData[[paste0("Freq.", i)]]
## t = -0.89542, df = 52, p-value = 0.3747
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3784983  0.1494659
## sample estimates:
##        cor 
## -0.1232263

As one would expect older water points to be more often nonfunctional than newer ones. There seems to be a nice linear relationship between the construction year and the relative amount of functional water points. The number of water points that need repair is relatively steady, probably because it is a rather transient state.

4.1.7 extraction_type_class

This variable like others as well shows nicely that if a value is missing for an entry (here called ‘other’), then the water point is very likely nonfunctional. Also motor pumps seem to break more often than other extraction mechanisms. Gravity and hand pumps, which are also the most common extraction methods, are also the most reliable methods next to rope pumps.

4.1.8 quality_group

Fluorided water, although quite uncommon, seems to correlate with a higher amount of functional water points. This again might be due to better maintenance, since fluoride is added as a disinfectant, which is probably mostly done for tourists and richer areas, where better care is taken for the water points. This is further confirmed by the high fraction of nonfunctional water points of water points where the water quality is unknown.

4.1.9 quantity_group

This variable very closely correlates to the status_group variable. Since dry water points are part of the definition of nonfunctional water points it is not surprising that nearly all dry water points are labeled nonfunctional. Water points with no quantity description are mostly nonfunctional as well. Water points giving enough water, which are the majority, are above average functional.

4.1.10 source_type

Water points using natural water sources (rainwater, rivers/lake, spring) are more often functional than man-made (borehole, dam, shallow well). Surprisingly the ‘unknown’ category is about average in number of functional water points in contrast to the previous variables.

4.1.11 waterpoint_type

The waterpoint_type variable strongly illustrates that as long as a value is given, the probability that the water point is functional is very high. As for the univariate analysis the frequencies given for dams are inconsistent to the ones given in the source variable. Since there are more entries labeled dam in source those frequencies might be more representative.

4.1.12 funder_count

There does not seem to be a very meaningful trend in the data. Companies funding few water points do not influence functionality in a better or worse way than organizations that fund a much higher number of water points. Although the outlier represented by government funded water points seems to deviate from the average by showing a lower number of functional water points.

4.1.13 installer_count

As for funders installer experience does not seem to influence functionality much. There is no consistent trend in the data.

4.1.14 management_group

Commercial management systems seem to be slightly better for water point functionality although by a probably insignificant margin. Again the ‘unknown’ category correlates with a higher amount of nonfunctional water points.

4.1.15 payment_type

The data shows that payment actually seems to help keeping the water points functional. The vast majority is not payed (which is good from an ethical point of view), but the number of functional water points is below average. Payed water points, especially with regular payments, tend to be functional more often.

4.1.16 permit

Permitting water points does not seem to have a big influence on functionality. It might have a slight positive effect, but that is probably neglectable.

4.1.17 population

It seems that a population of about 10-100 people around a water point is most favorable in terms of functionality. With no or only a few people around the interest to maintain the water point might be too low and having too many people around might lead to over usage and a higher maintenance demand on the water point. Very high numbers of people living around a water point give apparent high frequencies of functional water points. Due to the overall low number of those water points this might be a sampling error. But it might also hint to a confounding factor like higher maintenance interest or it is a lake, which because of its size alone might have a lot of people living around it and would since no mechanisms are needed to retrieve water is more likely to be functional.

4.1.18 public_meeting

Public meetings seem to have a slight positive effect on water point functionality.

4.1.19 scheme_management

State Water Commissions (SWC) seem to be a not working very well as a management system for water points, but it is apparently also only scarcely used. The best working management schemes are Water Boards, Water User Associations (WUA), trusts and private operators. Water points managed by the village water committee (VWC) scheme, the most common scheme, seem to have an about average number of functional water points.

4.2 Summary of the bivariate analysis

The date_recorded variable worked well as a sanity check, that there is no fluctuation in the functionality data depending on the recording time. But it probably won’t be useful as a predictor. The location of the water points seems to influence the amount of functional water points. It is hard to say whether this is due to geographical properties, like their geographical height, of the respective regions or political factors. The analysis clearly showed that larger fraction of older water points tends to be broken than newer ones, which makes construction_year a good predictor. Additionally, throughout most variables values representing NAs seem to correlate with a high fraction of nonfunctional water points. Thus the use of the number of NAs as a predictor could be considered. The main idea behind using NAs as a predictor is that for some metrics like amount_tsh some expertise or equipment might be needed, thus if the metric is measured it might also be more probable that the water point is maintained by professionals. Other probably more useful predictors would be extraction_type_class, quality_group, quantity_group, payment_type (may be cleaned further by using the following labels: unknown, source_type, never pay, recurring pay, on demand pay), population, and scheme_management. Public_meeting, management_group, waterpoint_type, amount_tsh might be also useful to some extent. But installer_count, funder_count and permit would probably not greatly influence the prediction.

5 Multivariate Analysis

In this data set multivariate analysis becomes complicated really fast, since most variable are categorical. Taking three categorical variables can result in a huge amount of possible combinations. This makes visualizing and interpreting them very complicated. To streamline the exploration in the multivariate analysis three groups of variables that based on the bivariate analysis might be the best candidates for predicting water point functionality: location (region, basin, latitude/longitude, gps_height), age (construction_year) and waterpoint type (source_type, extraction_type_class, waterpoint_type).

5.1 Difference in water point characteristics depending on location

5.1.1 Influence of the location on functionality

Plotting every water point onto a map of Tanzania and coloring it by their status does not very intuitively show, whether there is a pattern in the water point’s location causing differences in functionality:

For better orientation in later plots the mapped water points are colored by their region and basin respectively in the following plots:

Since the high number of water points makes it difficult to find patterns on the map, the water points will be binned by their coordinates and then plotted on the map. On the first map the fraction of functional water points per bin will be plotted. Bins with a value below 0.5 mostly contain nonfunctional water points and are colored red. Blue bins contain mostly functional water points.

There are some areas that have a high number of bins that contain more nonfunctional water points than functional. Those overlap especially with the southern coast and the area between the lakes Rukwa and Tanganyika, which corresponds well to regions and basins that were identified before in the bivariate analysis. But there are also a lot of bins with mostly nonfunctional water points up the coast and in parts of the inland and coast of Lake Victoria.

5.1.2 Differences in median construction year between locations

A variable that showed good correlation with functionality in the bivariate analysis was the construction year. Plotted to the map this data also shows that there are three areas with mostly older water points. Two of those overlap well with the regions Mtwara, Lindi and Rukwa, which were already identified as regions with many nonfunctional water points. But also the south of the region Singida has mostly quite old water points, but has only a sparse density of water points. There are big areas with unknown construction_year values in the inland. Thus the occurrence of NAs might also be regional.

The map showed clusters of water points with a high median age. To see whether this is reflected in the region- and basin-variable, box plots showing the distribution of construction years by region/basin will be plotted. Additionally the differences in water point status will be analyzed:

Two of the regions containing areas having a high amount of old water points, Rukwa and Mtwara, indeed have the earliest median construction years. The water points of Lindi on the other hand has a pretty average median age. And while the regions Shinyanga, Pwani, Manyara, which also have above average functional water points also have mostly newer water points, not so Arusha and Iringa, which have the highest fraction of functional water points. Thus the construction year seems to play a more important role in some regions than in others. Differentiating in status group shows that non functional water points in every region have a higher median age than functional ones. The difference in both medians varies for the different regions.

The same is true when grouping by basins:

5.2 Dependency of extraction type used

5.3 The influence of the extraction method and age on the waterpoint status

New technologies are developed over time. In case of water extraction methods in Tanzania rope pumps seem to be introduced relatively recently. Hand pumps and submersible also seem to have been less common in the sixties, but they also may already be completely broken and thus not appearing in the dataset. Interestingly the status of hand pumps seems to be less age-dependent than other extraction methods. Water points without a reported extraction method have an earlier median construction year, which could explain at least part of the high fraction of non functional water points in this group. Older wind-powered water points (1960-1990) seem to be nearly all broken down, thus wind-power might be not a good choice for a long lasting water point. A similar but far less severe observation can be seen for the other extraction methods as well, except for maybe hand pumps. Overall hand pumps seem to be the most durable extraction method.

5.3.1 The influence of extraction method and geographical height on waterpoint status

Different extraction methods are preferentially deployed at different heights. While gravity-based, rope- and wind-powered pumps are deployed at greater heights, hand- , motor-, submersible- and other pumps are more common at lower heights.Wind-powered pumps seem to need a certain height, since there are none at all at below ~600 m. Interestingly, for some extraction methods there also seems to be a connection between height and functionality. Thus for wind-powered and maybe also rope- and motor pumps more non functional pumps can be observed at bigger heights. For water points with ‘other’ extraction type non functional water points are mostly on lower heights than functional ones. For gravity-, submersible and probably also hand pumps height does not seem to influence functionality. Probable causes for greater heights having more non functional water points for some extraction methods might be the higher distance the water has to be pumped.

5.3.2 Which extraction methods are used for which source types andwhat kinds of water points are used for these?

Extraction methods are not exclusively used for a specific source, but there is always one source that is mainly served by an extraction method. Extraction by gravity is mostly used for springs; hand pumps, rope pumps and other for shallow wells; motor pumps, submersibles and wind-powered pumps for boreholes.

Nearly every source type and extraction method mainly uses communal standpipes (single or multiple). Another very common water point type is the hand pump. There also seems to be some ambiguity, since not every water point with the extraction method hand pump has the water point type hand pump, especially for the sources river/lake and spring. This kind of ambiguity might be negative for prediction. For some source types if the extraction method is not specified the waterpoint type is most of the time also not specified (source types: borehole, other, shallow well). For dams, rainwater harvesting and river/lake unknown extraction types are mostly communal standpipes. For springs it is mostly improved spring, which is a vague term. This observation could help to impute NAs in this categories.

5.3.3 Improved functionality by political measures

5.3.3.1 Permits

While permitting does not make a difference for some extraction methods, for others, namely wind-powered, rope pumps and motor pumps, permitting increases the fraction of functional water points.

5.3.3.2 Public meetings

Similar to permits there is an improvement in functionality of water points using specific extraction methods, namely gravity, hand pump and rope pumps.

5.3.3.3 Payment types

Most extraction methods seem to be influenced little by payment type, with never pay and unknown being the worst payment types in terms of keeping water points functional. But for motor pumps, submersibles and wind-powered pumps, irregular pay seems to go along with a lower fraction of functional water points.

5.4 Using NAs as a predictor

One observation that was made during the analysis of the given variables was, that NAs or values representing it most of the time corresponded with a higher fraction of nonfunctional water points. Thus here a new variable will be created giving the number of NAs of the respective entry. As a first step NAs have to be identified and named accordingly. Since longitude and latitude are both only part of a coordinate, if one is an NA the other will be as well. To not over-represent the NAs, the longitude variable will be removed here.

Then the different values representing NAs are transformed to NA. In numeric values the NA-value is 0 or -2e-08. This can be problematic, since 0 could actually be the measured value. For this analysis it will be assumed that this is not the case. Since 0 is the strongest represented value in numerical variables it is very likely that it is meant as NA. In categorical or logical variables, it can be ‘0’, ‘-’, ‘other’, ‘Other’, ‘Others’, ‘’, ’Unknown’ or ‘unknown’.

The total number of NAs is 272622 of 2376000, which is 11.47 %. That is quite a lot. Next a new variable is created containing the number of NAs of the respective entry.

## [1] "non functional"
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(wideSubData$independent) and wideSubData[[paste0("Freq.", i)]]
## t = 2.4171, df = 24, p-value = 0.02361
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.06651741 0.70840622
## sample estimates:
##       cor 
## 0.4424689 
## 
## [1] "functional"
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(wideSubData$independent) and wideSubData[[paste0("Freq.", i)]]
## t = -1.7575, df = 24, p-value = 0.09158
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.64116390  0.05715152
## sample estimates:
##        cor 
## -0.3376758 
## 
## [1] "functional needs repair"
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(wideSubData$independent) and wideSubData[[paste0("Freq.", i)]]
## t = -5.0813, df = 24, p-value = 3.383e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8658175 -0.4611456
## sample estimates:
##        cor 
## -0.7199042

With a higher amount of NAs for an entry the trend goes to a decreased fraction of functional water points. The error of the function gets bigger at higher counts of NAs, reducing the goodness of the fit, but this is probably due to the lower number of data entries having this high number of NAs. The median number of NAs per data entry were binned and plotted to the map. In the southeastern part of Tanzania low numbers of NAs correlate well with a higher fraction of functional water points. This is not the case in the northwestern part. For example, in the Rukwa region high fractions of water points are nonfunctional, but there are only few NAs in their data entries. The map also shows that in the northeastern inland the bins contain data entries with high numbers of NAs, which was already suggested by the missing bins on the maps for population and gps_height.

6 Final Plots and Summary

6.0.1 Plot One

6.0.2 Description One

The oldest water points recorded in the data set were built in 1960. Being more than 50 years old at the time of recording, one would expect that a considerable fraction is not functioning anymore, especially if comparing them to water points that were built later, e.g. the 2000’s. To verify this theory, the fractions of each status (‘functional’, ‘functional needs repair’ and ‘non functional’) for each construction year were calculated and plotted. A linear model was fitted through the points to visualize the trend. The plot nicely visualizes that older water points tend to be far more often nonfunctional than newly built ones. It even fits a linear relationship quite well. It is hard to interpret the data for the ‘functional needs repair’ status, since there only very few data entries within this class and since it is a rather transient state by nature, but it seems also less common for more recently built water points.

6.0.3 Plot Two

6.0.4 Description Two

Exploration of the dataset showed that there are regional differences. Thus in some areas of Tanzania there is a higher fraction of nonfunctional water points than in others. This is probably caused by a multitude of confounding factors. Since the construction year of the water points was established as a strong influence on the functional status of the water points, it could be one of the factors leading to these regional differences. To visualize this the water points’ locations were mapped on a map of Tanzania. Since the high amount of data points would make the detection of pattern rather unintuitive, bins defined by longitude and latitude were created and the median age of all the water points in one bin was calculated and visualized by color hue. The map nicely shows that there are some areas in which the median age of the water points is higher than in other areas. When comparing these areas to regions which showed a relatively high number of nonfunctional water points, there is an overlap with some of those regions, namely Mtwara, Lindi, Rukwa and Singida. While these are not all of the regions showing a below average number of functional water points, the construction year is definitively one of the factors causing regional differences. Thus some regions should consider modernizing their water points to improve the coverage of functional places to get water.

6.0.5 Plot Three

6.0.6 Description Three

One of the most obvious observations in the dataset was that there seems to be a high number of diversely labeled NAs and that a lot of times water points having these values were more likely to be nonfunctional. To validate these observations in a quantitative way, the NAs for each data entry were counted and added as a new feature of the dataset. Fractions of the different status for each count of NAs observed in the dataset were calculated and plotted. The plot proved the initial observations to be correct. With higher number of NAs in a data point it is more likely to be nonfunctional.


7 Summary and Reflection

The data set used in this exploration is very untidy. A lot of variables are redundant providing the same information at different accuracies. I tried to make a compromise in choosing the variable to keep with the aim to keep as much information as possible, while keeping the data set as clean and non-redundant in its values as possible. Some variables have a huge amount of levels. That makes prediction using them impossible. For those variables representing companies, I tried to describe them by their experience, by substituting the variable with the respective counts of water points they worked on. This gave me continuous variables to work with, but actually those variables did not add much information to the model. The other variables with too high amount of levels are regional descriptors. Since the coordinates and region names already provide information of this category, those variables were removed from the data set. But since those variables originate from federal divisions, it might be interesting to search for further information on which federal level water distribution is regulated. The analysis showed that the geographic location of the water point is very important for the prediction of the water points status’. There are several issues with this fact. The political regions are of course not the direct cause of the water points status’, but they stand for several con-founders, like tourism or wealth. Using such data would probably improve the model quite a bit. Even more when using the data for lower level political regions, since this provides a higher geospatial resolution. The most indicative variable next to location that is also causing part of the regional differences in water point status is the construction year, showing a linear relationship with the frequency of functional water points. The extraction method used for the water points is dependent on the geographical height and the construction year. Although an extraction method is not exclusive to a source type, there is always a strong preference for a source type. Political actions to improve water point functionality (permits, public meetings and payment) influence water points with different extraction methods differently. While public meetings can be beneficial for some methods, they seem to do nothing for others or might even be adverse in increasing overall functionality. Additionally, the number of NAs in the data entries relates to the water point’s status. This might be, because bad recording could also mean worse maintenance. But one should be careful in using data recording related variables for prediction, since it might insert a bias into the model. But at this point this observation might be used to visualize the insufficient maintenance of the water points which might be part of the problem and a way to improve the situation. There are a lot of other features, which show influence on the status of the water points and that were not as thoroughly analyzed. Thus there might be further things to be found in this data set. For example, motor pumps seem to be a bad way to get water. In this exploration the status of ‘functional needs repair’ was barely discussed. This is due to the very difficult to detect influences on it. Since this state is rather transient and in between both of the other states this might be difficult in any case. Although it might be very useful at one point to detect the need of repair as early as possible, with so many water points nonfunctional the more urgent thing would be to find causes and ways to repair them for now.